On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples
نویسندگان
چکیده
Sample selection bias is a common problem encountered when using data mining algorithms for many real-world applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so called “stationary or non-biased distribution assumption.” However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the training data. In this paper, we first improve and clarify a previously proposed categorization of sample selection bias. In particular, we show that unless under very restricted conditions, sample selection bias is a common problem for many real-world situations. We then analyze various effects of sample selection bias on inductive modeling, in particular, how the “true” conditional probability P (y|x) to be modeled by inductive learners can be misrepresented in the biased training data, that subsequently misleads a learning algorithm. To solve inaccuracy problems due to sample selection bias, we explore how to use model averaging of (1) conditional probabilities P (y|x), (2) feature probabilities P (x), and (3) joint probabilities, P (x, y), to reduce the influence of sample selection bias on model accuracy. In particular, we explore on how to use unlabeled data in a semi-supervised learning framework to improve the accuracy of descriptive models constructed from biased training samples. IBM T.J.Watson Research Center, Hawthorne, NY 10532, [email protected] Department of Computer Science, University at Albany, State University of New York, Albany, NY 12222, [email protected]
منابع مشابه
Type-Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing
Sample selection bias is a common problem in many real world applications, where training data are obtained under realistic constraints that make them follow a different distribution from the future testing data. For example, in the application of hospital clinical studies, it is common practice to build models from the eligible volunteers as the training data, and then apply the model to the e...
متن کاملDesigning a model of intuitionistic fuzzy VIKOR in multi-attribute group decision-making problems
Multiple attributes group decision making (MAGDM) is regarded as the process of determining the best feasible solution by a group of experts or decision makers according to the attributes that represent different effects. In assessing the performance of each alternative with respect to each attribute and the relative importance of the selected attributes, quantitative/qualitative evaluations ar...
متن کاملWhen Efficient Model Averaging Out-Performs Boosting and Bagging
Bayesian model averaging also known as the Bayes optimal classifier (BOC) is an ensemble technique used extensively in the statistics literature. However, compared to other ensemble techniques such as bagging and boosting, BOC is less known and rarely used in data mining. This is partly due to model averaging being perceived as being inefficient and because bagging and boosting consistently out...
متن کاملNearest Neighbor Density Ratio Estimation for Large-Scale Applications in Astronomy
In astronomical applications of machine learning, the distribution of objects used for building a model is often different from the distribution of the objects the model is later applied to. This is known as sample selection bias, which is a major challenge for statistical inference as one can no longer assume that the labeled training data are representative. To address this issue, one can re-...
متن کاملOn the Bias of the Maximum Likelihood Estimator for the Two- Parameter Lomax Distribution
The Lomax (Pareto II) distribution has found wide application in a variety of fields. We analyze the second-order bias of the maximum likelihood estimators of its parameters for finite sample sizes, and show that this bias is positive. We derive an analytic bias correction which reduces the percentage bias of these estimators by one or two orders of magnitude, while simultaneously reducing rela...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007